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A METHOD FOR RECOVERING 3D SCENE STRUCTURE AND CAMERA 
MOTION DIRECTLY FROM IMAGE INTENSITIES 



RELATED APPLICATION 
5^p^ ^The present applicanign is related to U.S. application Serial No, ,titled A 

Method for Recovering 3D Sti^ture and Camera Motion from Points, Lines and/or 

Directly from the Image Intensitiek^ filed on by the same inventor as the 

present application, which related ap^ation is incorporated herein by reference. 

BACKGROUND OF THE INVENTION 
10 1. Field of the Invention 

The present invention relates generally to a method for recovering the camera 

motion and 3D scene structure and, more particularly, to a linear algorithm for recovering 

the structure and motion directly from the image intensities where the camera moves 

along a line. 
15 2. Prior Art 

The science of rendering a 3D model from information derived from a 2D image 

predates computer graphics, having its roots in the fields of photogranmietry and 

computer vision. 

Photogrammetry is based on the basic idea that when a picture is taken, the 3D 
20 world is projected in perspective onto a flat 2D image plane. As a result, a feature in the 
2D image seen at a particular point actually lies along a particular ray beginning at the 
camera and extending out to infinity. By viewing the same feature in two different 
photographs the actual location can be resolved by constraining the feature to lie on the 
intersection of two rays. This process is known as triangulation. Using this process, any 



NECI1086 



1 



g:\nec\l 196\13725\spec\13725.pje 



13725.pje 

point seen in at least two images can be located in 3D. It is also possible to solve for 
unknown camera positions as well with a sufficient number of points. The techniques of 
photgrammetry and triangulation were used in such applications as creating topographic 
maps from aerial images. However the photogrammetry process is time intensive and 
5 inefficient. 

Computer vision techniques include recovering 3D scene structure from stereo 
images, where correspondence between the two images is established automatically from 
two images via an iterative algorithm, which searches for matches between points in 
order to reconstruct a 3D scene. It is also possible to solve for the camera position and 
10 motion using 3D scene structure from stereo images. 

Current computer techniques are focused on motion-based reconstruction and are 
a natural application of computer technology to the problem of inferring 3D structure 
(geometry) from 2D images. This is known as Structure-from-Motion. Structure from 
O motion (SFM), the problem of reconstructing an unknown 3D scene from multiple 2D 

yj 15 images of it, is one of the most studied problems in computer vision. 
Q SFM algorithms are currently known that reconstruct the scene from previously 

computed feature correspondences, usually tracked points. Other algorithms are direct 
methods that reconstruct from the images' intensities without a separate stage of 
correspondence computation. The method of the present invention presents a direct 
20 method that is non-iterative, linear, and capable of reconstructing from arbitrarily many 
images. Previous direct methods were limited to a small number of images, required 
strong assumptions about the scene, usually planarity or employed iterative optimization 
and required a starting estimate. 
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Most SFM algorithm^hat are currently known reconstruct the scene from 
previously computed feature co^spondences, usually tracked points. Other algorithms 
are direct methods that reconstruct\om the images intensities without a separate stage of 
correspondence computation. Previouk direct methods were limited to a small number of 
5 images, required strong assumptions abo\ the scene, usually planarity or employed 
iterative optimization and required a startin^stimate. 

These approaches have complementary advantages and disadvantages. Usually 
some fraction of the image data is of such low quality that it cannot be used to determine 
correspondence. Feature-based method address this problem by pre-selecting a few 
10 distinctive point or line features that are relatively easy to track, while direct methods 
attempt to compensate for the low quality of some of the data by exploiting the 
redundancy of the total data. Feature-based methods have the advantage that their input 
data is relatively reliable, but they neglect most of the available image information and 
only give sparse reconstructions of the 3D scene. Direct methods have the potential to 
15 give dense and accurate 3D reconstructions, due to their input data's redundancy, but 
they can be unduly affected by large errors in a fraction of the data. 

A method based on tracked lines is described in "A Linear Algorithm for Point 
and Line Based Structure from Motion", M. Spetsakis, CVGIP 56:2 230-241, 1992 , 
where the original linear algorithm for 13 lines in 3 images was presented. An 
20 optimization approach is disclosed in C.J. Taylor, D. Kriegmann, "Structure and Motion 
from Line Segments in Multiple Images, " PAMI 17:1 1 1021-1032, 1995. Additionally, 
in "A unified factorization algorithm for points, line segments and planes with 
uncertainty models" K. Morris and 1. Kanade, ICCV 696-702, 1998, describes work on 
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lines in an affine framework. A projective method for lines and points is described in 
"Factorization methods for projective structure and motion", B. Triggs, CVPR 845-851, 
1996, which involves computing the projective depths from a small number of frames. 
"In Defense of the Eight-Point Algorithm: PAMI 19, 580-593, 1995, Hartley presented a 
full perspective approach that reconstructs from points and lines tracked over three 
images. 

The approach described in M. Irani, "Multi-Frame Optical Flow Estimation using 
Subspace Constraints," ICCV 626-633, 1999 reconstructs directly from the image 
intensities. The essential step of Irani for recovering correspondence is a multi-frame 
generalization of the optical-flow approach described in B. Lucas and T. Kanade, "An 
Iterative Image Registration Technique with an Application to Stereo Vision", IJCAI 
674-679, 1981, which relies on a smoothness constraint and not on the rigidity constraint. 
Irani uses the factorization of D simply to fill out the entries of D that could not be 
computed initially. Irani writes the brightness constancy equation (7) in matrix form as A 
= -DI , where D tabulates the shifts d* and I contains the intensity gradients V / (pn). 
Irani notes that D has rank 6 (for a camera with known calibration), which implies that A 
must have rank 6. To reduce the effects of noise, Irani projects the observed A onto one 
of rank 6. Irani then applies a multi-image form of the Lucas-Kanade approach to 
recovering optical flow which yields a matrix equation DI2 = - A2, where the entries of h 
are squared intensity gradients la lb summed over the "smoothing" windows, and the 
entries of A2 have the form la AI. Due to the added Lucas-Kanade smoothing constraint, 
the shifts D or d n can be computed as D = - A2 [12]"^ denotes the pseudo-inverse, except in 
smoothing windows where the image intensity is constant in at least one direction. Using 
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the rank constraint on D, Irani determines additional entries of D for the windows where 
the intensity is constant in one direction. 

Any algorithm for small, linear motion confronts the aperture problem: the fact 
that the data within small image windows do not suffice to determine the correspondence 
unless one makes prior assumptions about the scene or motion. The aperture problem 
makes correspondence recovery a difficult and sometimes impossible global task. To 
avoid this, researchers typically impose a smoothness assumption. Lucas-Kanade uses a 
smoothing technique to address the aperture problem. 

SUMMARY OF THE INVENTION 

The present invention is directed to a method for recovering 3D scene structure 
and camera motion from image data obtained from a multi-image sequence, wherein a 
reference image of the sequence is taken by a camera at a reference perspective and one 
or more successive images of the sequence are taken at one or more successive different 
perspectives by translating and/or rotating the camera, the method comprising the steps 



(a) determining image data shifts for each successive image with respect to the 
reference image; the shifts being derived from the camera translation and/or rotation from 
the reference perspective to the successive different perspectives; 

(b) constructing a shift data matrix that incorporates the image data shifts for each 

image; 
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(c) calculating a rank-l factorizations from the shift data matrix using SVD, with 
one of the rank-l factors being a vector corresponding to the 3D structure and the other 
rank-l factor being a vector corresponding to the size of the camera motions; 

(d) dividing the successive images into smoothing windows; 

(e) recovering the direction of camera motion from the first vector corresponding 
to the 3D structure by solving a linear equation; and 

(f) recovering the 3D structure by solving a linear equation using the recovered 
camera motion. 

In accordance with the method of the present invention, step (e) includes step (e) 
includes: 

computing a first projection matrix; 

recovering camera rotation vectors from the shift data matrix, and the first 
projection matrix; 

computing a second projection matrix; and 

recovering the direction of camera translation using the shift data matrix, the 
reference image, the second projection matrix and the recovered camera rotation 
vectors. 

In addition, step (f) includes recovering the 3D structure from the shift data matrix, 
the reference image, the recovered camera rotation vectors and the recovered direction of 
translation vectors. 
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The method of the present invention further includes preUminary steps of 
recovering the rotations of the camera between each successive image; and warping all 
images in the sequence toward the reference image, while neglecting the translations. 

The present invention provides an algorithm for linear camera motion, where the 
camera moves roughly along a line, possibly with varying velocity and arbitrary 
rotations. The approach of the present invention applies for calibrated or uncalibrated 
cameras (the projective case). For specificity, we focus on the calibrated case, assuming 
(wlog) that the focal length is 1 . The method is based on the brightness constancy 
equation (BCE) and thus requires the motion and image displacements to be small 
enough so that the intensity changes between images can be modeled by derivatives at 
some resolution scale. 

BRIEF DESCRIPTION OF THE DRAWINGS 

These and other features, aspects, and advantages of the methods of the present invention 
will become better understood with regard to the following description, appended claims, 
and accompanying drawings where: 

FIG. 1 schematically illustrates a hardware implementation of the present invention. 
FIG. 2 is a block diagram that illustrates the method of the present invention. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT 
Definitions 
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^The method of th\present invention assumes that the 3D structure is to be 
recovered from an image sequence consists of Nj images of fixed size, each with 

pixels. Let p„ = {x„,y„y gi\e the image coordinates of the n-ih pixel position. Let /' 
denote the /-th image, with /=oY..,A^/ -1, and let /;; =r{pj denote the image 
intensity at the n-th pixel positioAin/' . We take 7° as the reference image. Let P„ 
denote the 3D point imaged at p„ i\ the reference image, with P„ = {X^ Jn^'^n ^ the 
coordinate system of f . Let dj, denote the shift in image position from 7° to V of the 
3D feature point P„ , The motion of th\ camera is described as its translation and 

rotation. Let T' = (rj, 7/ J[ f represent the camera translation between the reference 

image and image /, and let 7?' denote the lamera rotation. In accordance with the method 
of the present invention we parameterize aUmall rotation by the rotational velocity 

co' ={a)l,o)[,,o)^J . Let a 3D point P transform as P' = 7?(P-T). Let 
p'^ = (x^ , 3;^ )^ = p„ + d'„ be the shifted positid|^ in 7' of p„ € 7^ resulting from the 
motion T',/?. . 

Given a vector V , define [vjj as the length-2 vector consisting of the first two 
components of V . Let V denote the 2D image point corresponding to the 3D point V: 

V = [v]2 / . For a 2D vector v, define the corresponding 3D point v = [v^ if . 7? * v 
denotes the image point obtained from v after a rotation: 7? * v = (7fv). Let v = v/|v| . 
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The three rotational flows of the camera are defined as 



-xy 



^ xy 



y X J 



r(')(x,;;^r(^)(x,3;),r«(x,;;)by[r<'),r(^),r(^)]^ 

p^^^^^et VI ^ = V/(p J r^esent the (smoothed gradient of the image intensities 
(p„ ) and define , f . Similarly, let Ml be the change in (smoothed) 
5 intensity with respect o the referen^ image. With no smoothing AI'^ =1^- ll . Let A 
be a (A^^ - 1)x TV^^ matrix with entries^ 
Suppose F'' is a set of quantities index^by the integer a. The notation {K}is used to 
denote the vector with elements given by |\ Let the {N^ - 1)x 3 matrices 
T = [{t^ } {Xy } {T^ }] and W = ^cd^ ] } {(d^ I encode all translations and rotational 
10 velocities for a sequence. We use the notatio^ {f} to denote the vector with elements 
given by the V , 



15 



20 



Define the (N^ - 1)x{N^ - 1) matrix C"' = S'' + 1 , and use 



Preliminary Analysis 

Before describing the method of the present invention, we shall describe the 
preliminary analysis used to derive the translational and rotational flow vectors to be 
applied in the algorithm. For small rotations and translations, the matrix of feature-shifts 
is approximately bilinear in the unknown T' , R' , Z„ . (We do not assume that the 
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rotations are small initially, but we can take them as small following their initial recovery 
and compensation.) By contracting this matrix with the VI ^ , we get via the brightness 
constancy equation (described later herein) a bilinear relation between the intensity 
changes /Sl[ and unknowns. 

Derivation 

The derivation of the flow vectors is described as follows. Up to noise, the 
feature-shift d), can be written as d'^ = d'y-„ + d'^„ , (1) 

dk= ^"'f:^;"^r'^^l < .pl-/?-'*p:,. Where d',„ =d 

represents the rotational part of the shift and d'^^^ represents the translational part. When 
there is zero rotation, d'^^^ = d'^^^ . One can rewrited'^^ = * p^^ -py^„ , where 
Py,^ = * p^ = P„ + ^Tn • assume small translations and small residual rotations. 
Thenp, «p„+o(z;'), 

dk-^;(7;^P.-[T^]2K^>(^"-0. (2) 
^ colr^'^(pJ-^co;T^'^ip„)-hco;^^^^ Z-\t,€D representthe 

average sizes of the Z:^ , the translations and the residual rotations in radians. From "A 

Linear Solution for Multifi-ame Structure fi-om Motion", J. Oliensis, lUW 1225-1231, 

1994 we get, co « Z"'r . 

Then using the brightness constancy equation (BCE), 

A/:+V/„.dl =0, (3) 
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1^ lll^ 

which holds up to corrections of o(z"V^6>^ , 7) where rj gives the typical size 
of the noise in . The brightness constancy equation and (2) imply that 
-Ml ^ Z;'(vI„^pX -VI„[rl)+^^ Then we define the 

three length- translational flow vectors as 

and also define the three length- A'^^ rotational flow vectors as 
-{v/-r<'>(p)W, -{v/t(^)(p)),^. ^{v/t(^)(p) . 

Thenlet <D = [o, O^ojand ^ = Then -A wTO"^ + W^''. 

Then we define H as a [n^ - 3)x matrix that annihilates the three 
J 10 vectors Y^,^^ and satisfies HH^ = 1^ .3, where 1^ „3 is the identity matrix. One 
O can then compute H, and products involving H, in o{Np ) using Householder matrices, 

2 which are described in "A Linear Solution for Multifi-ame Structure fi-om Motion", J. 

[J Oliensis, lUW 1225-1231, 1994, and "A Multi-fi-ame Structure fi-om Motion Algorithm 

Q under Perspective Projection" J. Oliensis, IJCV 34:2/3, 163-192, 1999, and Workshop on 

15 visual Scenes, 77-84, 1995. It then follows that 

-AH^ ^^TO^H^ (4) 
up to o{z ~^r^,6)^,o)Z~W,7]), In practice, we use equation (4) above left-multiplied by 

C ^ , with A^^ = C ^AH^ . Multiplying by C ^ reduces the bias due to singling out 
the reference image for special treatment, a process described in "A Multi-fi:*ame 
20 Structure fi-om Motion Algorithm under Perspective Projection" which is referenced 
above. 
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Equation (4) relates the data-matrix, on the left, to the translations and structure, 
on the right. Multiplying by H has eliminated the rotational effects up to second order. 
These second order corrections include corrections of 0{(07]), caused by errors in the 
measured V/ we use to define H. For small translations, o(z"V)-- 0(6)) as described in 
5 "Rigorous Bounds for Two-Frame Structure from Motion," J. Oliensis lUW, 1225-1231, 
1994 so all the corrections in equation (4) have similar 

sizes: o(z"^r^ o(cyZ"V)'- o{p)^]. Therefore, multiplying by H was crucial to reduce 
the rotational corrections to the same order as the translational corrections. 

Linear-Motion Algorithm 

The basic algorithm of the present invention for cases of linear camera motion is 
more particularly described as follows. 

0. Recover rotations and warp all images /\../^*"^ toward the reference 
image , while neglecting the translations. Let the image displacements d'„ now 
refer to the unrotated images. 

1. Compute H and A^^^^ . Using the singular value decomposition, compute 
the best rank-1 factorization of -A « M^'^S^^^^ where M^^\S^^^ are vectors. If 
the leading singular value of - A^^ is much larger than the rest, this confirms that 
the motion is approximately linear and that the signal dominates the noise so that 

the algorithm has enough information to proceed. C'^M^*^ gives the translation 
magnitudes up to an overall multiplicative scale. 
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2. Divide the image into small smoothing windows and take as constant 
within each window. List the pixels so that those in the A:-th smoothing window 
have sequential indices -q^Sjlk +0v"(7a-hi ~1) Th^^ compute di N^y^Np 
projection matrix which is block diagonal with zero entries between different 
smoothing windows, and which annihilates the vectors {v/ • p}, {/^ }, and {/ J . 
Then solve the overconstrained system of equations 

P^(h"sW-^>v)=0 (5) 
for the 3-vector w. 

To complete the method of the Linear-Motion Algorithm, compute ^ N^xN^ 
projection matrix P^^ , which is block diagonal with zero entries between different 
smoothing windows and annihilates - ^'w where w is the vector recovered 

previously. Then solve for the direction of translation t via 

Pf(-^J/x}-tkv}+n{p-V/})=0 (6) 
Finally, recover Z„ via 

(//^S('))„ -M„ = Z-„itp„ -[t],)- V/„ (7) 

Linear-Motion Algorithm Analysis 

Step 2. 

From (4), S^'^ ~ h{z"' (tIp - [f Jj )• V/} Then since H^H is a projection matrix 
annihilating the T^,^,.,*F, it follows that 
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(8) 



for some w. Since the matrix P^^ amiihilates the first term on the right hand side of (8), 



constraints, where is the number of windows. Then applying to (8) gives (6). 

Because we omitted 2{N^ - 1) constraints, Step 2 gives a suboptimal estimate of 

t and Z^' . As before, one can base Step 2 on a multi-frame reestimate of and as 

before the caveat that if the original noise in is less than the recomputed , one 

should use directly. 

The linear-motion algorithm extends to deal with a camera translating on a plane 

or in all 3D directions. The number of large singular values A^/^ determines the 

dimensionality of the motion, e.g., planar motion corresponds Xo N^=2, For each large 
singular value, the corresponding singular vector gives rise to an equation similar to (8), 
which can be solved as before for t , where each singular vector yields a different T . 
One recovers the Z'J from A^^ equations of the form of (7). 

Implementarion 

It will be apparent to those skilled in the art that the methods of the present 
invention disclosed herein may be embodied and performed completely by software 
contained in an appropriate storage medium for controlling a computer. 

Referring to Fig. 1, which illustrates in block-diagram form a computer hardware 
system incorporating the invention. As indicated therein, the system includes a video 
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source 101 , whose output is digitized into a pixel map by a digitizer 102. The digitized 
video frames are then sent in electronic form via a system bus 103 to a storage device 104 
for access by the main system memory during usage. During usage the operation of the 
system is controlled by a central-processing unit, (CPU) 105 which controls the access to 
5 the digitized pixel map and the invention. The computer hardware system will include 
those standard components well-known to those skilled in the art for accessing and 
displaying data and graphics, such as a monitor, 106 and graphics board 107. 

The user interacts with the system by way of a keyboard 108 and or a mouse 109 
or other position-sensing device such as a track ball, which can be used to select items on 
□ 10 the screen or direct functions of the system. 

^ The execution of the key tasks associated with the present invention is directed by 

instructions stored in the main memory of the system, which is controlled by the CPU. 
The CPU can access the main memory and perform the steps necessary to carry out the 
method of the present invention in accordance with instructions stored that govem CPU 
15 operation. Specifically, the CPU, in accordance with the input of a user will access the 
stored digitized video and in accordance with the instructions embodied in the present 
invention will analyze the selected video images in order to extract the 3D structure 
information from the associated digitized pixel maps. 

Referring now to Fig. 2 the method of the present invention will be described in 
20 relation to the block diagram. A first image in a sequence is taken by a camera at a 

reference perspective and one or more successive images are taken by moving the camera 
along a substantially linear plane to one or more successive different perspectives in step 
201. The images are then digitized 202 for analysis of the 3D image content, i.e. image 
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intensities. From the digitized 3D image content, deteraiining image data shifts for each 
successive image 203 with respect to the reference image; the shifts being derived from 
the camera translation and/or rotation from the reference perspective to the successive 
different perspectives. 

Then incorporating the image data shifts for each image, constructing a shift data 
matrix 204. The shift data matrix is then used to calculate two rank-1 factorizations from 
the shift data matrix using SVD, one rank-1 factorization being a vector corresponding 
the 3D structure and the other rank-1 factorization being a vector corresponding the 
camera motion 205. The successive images are divided into smoothing windows 206 and 
the camera motion is recovered from the factorization vectors between the smoothing 
windows by solving a linear equation 207. Finally, the 3D structure is recovered by 
solving a linear equation using the recovered camera motion 208. 

While there has been shown and described what is considered to be preferred 
embodiments of the invention, it will, of course, be understood that various modifications 
and changes in the form or detail could readily be made without departing from the spirit 
of the invention. It is therefore intended that the invention be not limited to the exact 
forms described and illustrated, but should be constructed to cover all modifications that 
may fall within the scope of the appended claims. 
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